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Abstract 

We analyze general model selection procedures using penalized empirical loss minimization 
under computational constraints. While classical model selection approaches do not consider 
computational aspects of performing model selection, we argue that any practical model se- 
lection procedure must not only trade off estimation and approximation error, but also the 
computational effort required to compute empirical minimizers for different function classes. 
We provide a framework for analyzing such problems, and we give algorithms for model selec- 
tion under a computational budget. These algorithms satisfy oracle inequalities that show that 
the risk of the selected model is not much worse than if we had devoted all of our computational 
budget to the optimal function class. 



1 Introduction 

In decision-theoretic statistical settings, one receives samples {z±, . . . , z n } C Z drawn i.i.d. from 
some unknown distribution P over a sample space Z, and given a loss function £, seeks a function 
/ to minimize the risk 

R(f):=E[£(z,f)]. (1) 

Since R(f) is unknown, the typical approach is to compute estimates based on the empirical risk, 
Rn(f) '■= ^ Sr=i ^( z *> /)' over a function class T . Through this, one seeks a function f n with a risk 
close to the Bayes risk, the minimal risk over all measurable functions, which is R$ := inffR(f). 
There is a natural tradeoff based on the class T one chooses, since 



R(fn) ~R0= [R(fn) ~ }rfR(f)) + ( jDf J?(/) " ^0 

which decomposes the excess risk of f n into estimation error (left) and approximation error (right). 
A common approach to addressing this tradeoff is to express J- as a union of classes 

T=\J? j . (2) 

The model selection problem is to choose a class and a function / € T% that give the best tradeoff 
between estimation error and approximation error. A standard approach to the model selection 
problem is the now classical idea of complexity regularization, which arose out of early works by 
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Mallows 21] and Akaike [l|. The complexity regularization approach balances two competing 
objectives: the minimum empirical risk of a model class T% (approximation error) and a complexity 
penalty (to control estimation error) for the class. Different choices of the complexity penalty 
give rise to different model selection criteria and algorithms (for example, see the lecture notes by 
Massart [2^] and the references therein). The complexity regularization approach uses penalties 
7i : N — > R+ associated with each class J~i to perform model selection, where 7i(n) is a complexity 
penalty for class i when n samples are available; usually the functions 7$ decrease to zero in n and 
increase in the index i. The actual algorithm is as follows: for each i, choose 

fi £ argmini? n (/) and select f n = argmin \RnUi) + Jiiri)\ (3) 

f€Fi i=l,2,... <- > 

as the output of the model selection procedure, where R n denotes the n-sample empirical risk. 
Results of several authors U, 2(3, 2^ show that with appropriate penalties ji and given a dataset 



of size n, the output /„ of the procedure roughly satisfies 

ER(f n ) -R < min inf R(f) - R + 7i (n) 

Several approaches to complexity regularization are possible, and an incomplete bibliography in- 
cludes the papers @, [H, B i, Q £} . 



Oracle inequalities of the form ([J]) show that, for a given sample size, complexity regularization 



procedures trade off the approximation and estimation errors, often optimally [23|]. A drawback of 
the above approaches is that in order to provide guarantees on the result of the model selection 
procedure, one needs to be able to optimize over each model in the hierarchy (that is, compute 
the estimates for each i). This is reasonable when the sample size n is the key limitation, and 
it is computationally feasible when n is small and the samples z are low-dimensional. However, 
the cost of fitting a large number of model classes on a large, high-dimensional dataset can be pro- 
hibitive; such data is common in modern statistical settings. In such cases, it is the computational 
resources — rather than the sample size — that form the key inferential bottleneck. In this paper, 
we consider model selection from this computational perspective, viewing the amount of computa- 
tion, rather than the sample size, as the quantity whose effects on estimation we must understand. 
Specifically, we study model selection methods that work within a given computational budget. 

An interesting and difficult aspect of the problem that we must address is the interaction 
between model class complexity and computation time. It is natural to assume that for a fixed 
sample size, it is more expensive to estimate a model from a complex class than a simple class. 
Put inversely, given a computational bound, a simple model class can fit a model to a much larger 
sample size than a rich model class. So any strategy for model selection under a computational 
constraint should trade off two criteria: (i) the relative training cost of different model classes, 
which allows simpler classes to receive far more data (thus making them resilient to overfitting) , 
and (ii) lower approximation error in the more complex model classes. 

In addressing these computational and statistical issues, this paper makes two main contribu- 
tions. First, we propose a novel computational perspective on the model selection problem, which 
we believe should be a natural consideration in statistical learning problems. Secondly, within 
this framework, we provide algorithms for model selection in many different scenarios, and provide 
oracle inequalities on their estimates under different assumptions. Our first two results address the 
case where we have a model hierarchy that is ordered by inclusion, that is, T\ C JF2 Q C .... 
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The first result provides an inequality that is competitive with an oracle knowing the optimal class, 
incurring at most an additional logarithmic penalty in the computational budget. The second result 
extends our approach to obtaining faster rates for model selection under conditions that guaran- 
tee sharper concentration results for empirical risk minimization procedures; oracle inequalities 
under these conditions, but without computational constraints, have been obtained, for example, 
by Bartlett 0] and Koltchinskii [3]. Both of our results refine existing complexity-regularized risk 
minimization techniques by a careful consideration of the structure of the problem. Our third re- 
sult applies to model classes that do not necessarily share any common structure. Here we present 
a novel algorithm — exploiting techniques for multi-armed bandit problems — that uses confidence 
bounds based on concentration inequalities to select a good model under a given computational 
budget. We also prove a minimax optimal oracle inequality on the performance of the selected 
model. All of our algorithms are computationally simple and efficient. 

The remainder of this paper is organized as follows. We begin in Section [2] by formalizing our 
setting for a nested hierarchy of models, providing an estimator and oracle inequalities for the 
model selection problem. In Section [3l we refine our estimator and its analysis to obtain fast rates 
for model selection under some additional reasonable (standard) conditions. We study the setting 
of unstructured model collections in Section [U Detailed technical arguments and various auxilliary 
results needed to establish our main theorems and corollaries can be found in the appendices. 



2 Model selection over nested hierarchies 

In many practical scenarios, the family of models with which one works has some structure. One 
of the most common model selection settings has the model classes T% ordered by inclusion with 



increasing complexity (e.g. [111]). In this section, we study such model selection problems; we begin 
by formally stating our assumptions and giving a few natural examples, proceeding thereafter to 
oracle inequalities for a computationally efficient model selection procedure. 

2.1 Assumptions 

Our first main assumption is a natural inclusion assumption, which is perhaps the most common 



assumption in prior work on model selection (e.g. 111. l20l|): 

Assumption A. The function classes Ti are ordered by inclusion: 

JlCJ 2 CJ 3 C... (5) 

We provide two examples of such problems in the next section. In addition to the inclusion assump- 
tion, we make a few assumptions on the computational aspects of the problem. Most algorithms 
used in the framework of complexity regularization rely on the computation of estimators of the 
form 

fi = argmin R n (f), (6) 

either exactly or approximately, for each class i. Since the model classes are ordered by inclusion, 
it is natural to assume that the computational cost of computing an empirical risk minimizer from 
J~i is higher than that for a class J-j when i > j. Said differently, given a fixed computational 
budget T, it may be impossible to use as many samples to compute an estimator from Fi as it is 



3 



to compute an estimator from Tj (again, when i > j). We formalize this in the next assumption, 
which is stated in terms of an (arbitrary) algorithm A that selects functions f E J~i for each index 
i based on a set of rii samples. 

Assumption B. Given a computational budget T, there is a sequence {ni(T)}i C N such that 

(a) ni(T) > nj(T) for i < j. 

(b) The complexity penalties ji satisfy 7j(nj(T)) < jj(nj(T)) for i < j. 

(c) For each class Ti, the computational cost of using the algorithm A with rij(T) samples is T. 
That is, estimation within class Ti using ni(T) samples has the same computational complexity 
for each i. 

(d) For all i, the output A (i, T) of the algorithm A, given a computational budget T, satisfies 

R rH{T) (A(i,T)) - inf R rii(T) (f) < 7iK(T)). 



(e) As i t oo, 7i(n) — > oo for any fixed n. 



The first two assumptions formalize a natural notion of computational budget in the context 
of our model selection problem: given equal computation time, a simpler model can be fit using a 
larger number of samples than a complex model. Assumption Etc) says that the number of samples 
ni(T) is chosen to roughly equate the computational complexity of estimation within each class. 
Assumption |B]^d) simply states that we compute approximate empirical minimizers for each class 
T%. Our choice of the accuracy of computation to be 7, in part (d) is done mainly for notational 
convenience in the statements of our results; one could use an alternate constant or function and 
achieve similar results. Finally part (e) rules out degenerate cases where the penalty function 
asymptotes to a finite upper bound, and this assumption is required for our estimator to be well- 
defined for infinite model hierarchies. In the sequel, we use the shorthand 7i(T) to denote 7i(nj(T)) 
when the number of samples rij(T) is clear from context. 

Certainly many choices are possible for the penalty functions ji, and work studying appropriate 
penalties is classical (see e.g. [2, El)]). Our focus in this paper is on complexity estimates derived from 



concentration inequalities, which have been extensively studied by a number of researchers Hi, l23l . 

18]. Such complexity estimates are convenient since they ensure that the penalized empirical 
risk bounds the true risk with high probability. Formally, we have 

Assumption C. For all e > and for each i, there are constants K%,K2 > such that for any 
budget T the output A (i, T) £ Ti satisfies, 

F (\R MT) (A(i,T)) - R(A(i,T))\ > 7l (T) + K2 e) < «! exp(-4ni(T)e 2 ). (7) 
In addition, for any fixed function f G T{, F(\R n .^)(f) — R(f)\ > ^2 e ) < ki exp(— 4nj(T)e 2 ). 
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2.2 Some illustrative examples 

We now provide two concrete examples to illustrate Assumptions lAHCl 

Example 1 (Linear classification with nested balls). In a classification problem, each sample z% 
consists of a covariate vector x € M. d and label y € { — 1, +1}- In margin-based linear classification, 
the predictions are the sign of the linear function f$(x) = (9, x), where 9 £ M. d . A natural sequence 
of model classes is sets {fe} indexed via norm-balls of increasing radii: T% = {fe : 9 £ R d , \\9\\ 2 < n}, 
where < t\ < r 2 < ■ ■ ■■ By inspection, Ti C T+i so that this sequence satisfies Assumption lAl 

The empirical and expected risks of a function fg are often measured using the sample av- 
erage and expectation, respectively, of a convex upper bound on the 0-1 loss l(yf (x)<o)- Exam- 
ples of such losses include the hinge loss, £(yfg(x)) = max(0, 1 — yfo(x)), or the logistic loss, 
£(yfe(x)) = log(l + exp(— yfo(x))). Assume that E[||x|| 2 ] < X 2 and let Oi be independent uniform 
{±l}-valued random variables. Then we may use a penalty function 7, based on Rademacher 
complexity Dtn(Ti) of the class i, 



n 



sup 



2nX 

< 



Setting 7j to be the Rademacher complexity 9 c l n (J r j) satisfies the conditions of Assumption O 0] for 
both the logistic and the hinge losses which are 1-Lipschitz. Hence, using the standard Lipschitz 
contraction bound M, Theorem 12], we may take ji(T) = 2 , ViX . 

V n i( T ) | 1 

To illustrate Assumption [Bj we take stochastic gradient descent [26( as an example. Assuming 
that the computation time to process a sample z is equal to the dimension d, then Nemirovski et 



al. [24| show that the computation time required by this algorithm to output a function / = A(i, T) 



satisfying Assumption iBTd) (that is, a 7j-optimal empirical minimizer) is at most 

4r?X 2 



7f(T) 



d. 



Substituting the bound on ji(T) above, we see that the computational time for class i is at most 
drii(T). In other words, given a computational time T, we can satisfy the Assumption [B] by setting 
rii(T) oc T/d for each class i — the number of samples remains constant across the hierarchy in this 
example. 

Example 2 (Linear classification in increasing dimensions). Staying within the linear classification 
domain, we index the complexity of the model classes T% by an increasing sequence of dimensions 
{di} C N. Formally, we set 

Ti = {fe ■ Oj = for j > d h \\9\\ 2 < n}, 

where < r\ < r 2 < . . .. This structure captures a variable selection problem where we have a 
prior ordering on the covariates. 

In special scenarios, such as when the design matrix X = [x\ x 2 ■ ■ ■ x n ] satisfies certain 
incoherence or irrepresentability assumptions 12|], variable selection can be performed using l\- 



regularization or related methods. However, in general an oracle inequality for variable selection 
requires some form of exhaustive search over subsets. In the sequel, we show that in this sim- 
pler setting of variable selection over nested subsets, we can provide oracle inequalities without 
computing an estimator for each subset and without any assumptions on the design matrix X. 
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For this function hierarchy, we consider complexity penalties arising from VC-dimension argu- 
ments 27, 9|, in which case we may set 



7* CO 



di 



Ui{T) 



which satisfies Assumption [Q Using arguments similar to those for Example [TJ we may conclude 
that the computational assumption [B] can be satisfied for this hierarchy, where the algorithm A 
requires time diTii(T) to select / € T%. Thus, given a computational budget T, we set the number 
of samples rii(T) for class i to be proportional to T /di. 

We provide only classification examples above since they demonstrate the essential aspects of 
our formulation. Similar quantities can also be obtained for a variety of other problems, such 
as parametric and non-parametric regression, and for a number of model hierarchies including 
polynomial or Fourier expansions, wavelets, or Sobolev classes, among others (for more instances, 
see, e.g. QSQl). 

2.3 The computationally-aware model selection algorithm 

Having specified our assumptions and given examples satisfying them, we turn to describing our 
first computationally-aware model selection algorithm. Let us begin with the simpler scenario where 
we have only K model classes (we extend this to infinite classes below). Perhaps the most obvious 
computationally budgeted model selection procedure is the following: allocate a budget of T/K 
to each model class i. As a result, class i's estimator /j = A(i,T/K) is computed using m(T/K) 
samples. Let /„ denote the output of the basic model selection algorithm ([3]) with the choices n = 
rii(T/K), using m(T/K) samples to evaluate the empirical risk for class i, and modifying the pen alt y 



7i to be 7j(n) = 7i(n) + y^og i/n. Then very slight modifications of standard arguments [23|, 
yield the oracle inequality 



R(f n ) < . min I R* + c % I % ) + 



=1 K 



_k>gj_\ 

Kj ' V ni (T/K) J 



with high probability, where c is a universal constant. This approach can be quite poor. For 
instance, in Example [21 we have rii(T/K) = T/(Kdi), and the above inequality incurs a penalty 
that grows as \[~K. This is much worse than the logarithmic scaling in K that is typically possible 
in computationally unconstrained settings [ll|]. It is thus natural to ask whether we can use the 
nested structure of our model hierarchy to allocate computational budget more efficiently. 

To answer this question, we introduce the notion of coarse-grid sets, which use the growth 
structure of the complexity penalties 7^, to construct a scheme for allocating the budget across the 
hierarchy. Recall the constant K2 from Assumption [C] and let m > be an arbitrary constant (we 
will see that m controls the probability of error in our results). Given s £ N (s > 1), we define 



-(T \ 9 ( T \ j. 2(m + log£) 

7 * (r ' s):=27 *UJ +K2 V TH(T/s) • (8) 

Notice that, to simplify the notation, we hide the dependence of 7^ on m. With the definition (jSj), 
we now give a definition characterizing the growth characteristics of the penalties and sample sizes. 
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Coarse Grid 




Figure 1: Construction of the coarse- grid set S\. The X-axis is the class index i, and the Y-axis 
represents the corresponding complexity 7j(T). When the penalty function grows steeply early 
on, we include a large number of models. The number of complex models included in S\ can be 
significantly smaller as the growth of penalty function tapers out. 

Definition 1. Given a budget T , for a set SCN, we say that S satisfies the coarse grid condition 
with parameters \, m, and s if \S\ = s and for each i there is an index j € S such that 

7i(T, s) < 7,(T, s) < (1 + A)7i(T, s). (9) 

Figure Q] gives an illustration of the coarse-grid set. For simplicity in presentation, we set A = 1 in 
the statements of our results in the sequel. 

If the coarse-grid set is finite and, say, \S\ = s, then the set S presents a natural collection of 
indices over which to perform model selection. We simply split the budget uniformly amongst the 
coarse-grid set S, giving budget T/s to each class in the set. Indeed, the main theorem of this 
section shows that for a large class of problems, it always suffices to restrict our attention to a finite 
grid set S, allowing us to present both a computationally tractable estimator and a good oracle 
inequality for the estimator. In some cases, there may be no finite coarse grid set. Thus we look 
for way to restrict our selection to finite sets, which we can do with the following assumption (the 
assumption is unnecessary if the hierarchy is finite). 

Assumption D. (a) There is a constant B < oo such that R\ < B. 

(b) For all n £ N the penalty function 71 (re) > 1/n. 

Assumption [D^a) is satisfied, for example, if the loss function is bounded, or even if there is a 
function / G T\ with finite risk. Assumption E^b) also is mild; unless the class T\ is trivial, in 
general classes satisfying Assumption [Cl have 71 (n) = U(l/^/n). 

Under these assumptions, we provide our computationally budgeted model selection procedure 
in Algorithm [TJ We will see in the proof of Theorem [1] below that the assumptions ensure that we 
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Algorithm 1 Computationally budgeted model selection over nested hierarchies 

Input: Model hierarchy {J~i} with corresponding penalty functions 7$, computational budget T, 

upper bound B on the minimum risk of class 1, and confidence parameter m. 

Construction of the coarse-grid set S: 

Set s= flog 2 (l + 5m(r))l + 2. 

for k = to s — 1 do 

Set jk+i to be the largest class for which 7 -(T/s) < 2 k j 1 (T/s). 

end for 

Set S = {j k : k = l,...,s}. 
Model selection estimate: 
Set fi = A(i,T/s) for i £ 5. 
Select a class i that satisfies 

^-5^{^cr/.)(/«) + i»Cr/.) + f ^=^+f y^}- do) 

Output the function / = j^ = .4.(i, T/s). 



can build a coarse grid of size 

s= |Tog 2 (l + Sni(T))l+2. 

In particular, Assumption |B]^d) ensures that the complexity penalties continue to increase with the 
class index i. Hence, there is a class K such that the complexity penalty "fx is larger than the 
penalized risk of the smallest class T\, at which point no class larger than K can be a minimizer 
in the oracle inequality. The above choice of s ensures that there is at least one class j € S so that 
j > K, allowing us to restrict our attention only to the function classes {J~i \ i G S}. 

2.4 Main result and some consequences 

With the above definitions in place, we can now provide an oracle inequality on the performance of 
the model selected by Algorithm [TJ We start with our main theorem, and then provide corollaries 
to help explain various aspects of it. 

Theorem 1. Let f = A(i,T/s) be the output of the algorithm A for the class i specified by the 
procedure t!0\) . Let Assumptions \j$\W\ be satisfied. With probability at least 1 — 2Kiexp(— m) 

Furthermore, if n\(T) = 0{T) then s = O(logT). 

The assumption that ni(T) is linear is mild: unless T\ is trivial, any algorithm for T\ must at 
least observe the data, and hence must use computation at least linear in the sample size. 

Remarks: To better understand the result of Theorem [H we turn to a few brief remarks. 
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(a) We may ask what an omniscient oracle with access to the same computational algorithm A 
could do. Such an oracle would know the optimal class i* and allocate the entire budget T to 
compute A(i*,T). By Assumption [C] the output / of this oracle satisfies, with probability at 
least 1 — K\ exp(— m), 

Comparing this to the right hand side of the inequality of Theorem [H we observe that not 
knowing the optimal class incurs a penalty in the computational budget of roughly a factor of 
s. This penalty is only logarithmic in the computational budget in most settings of interest. 

(b) Algorithm Q] and Theorem [Q as stated, require a priori knowledge of the computational budget 
T. We can address this using a standard doubling argument (see e.g. 0, Sec. 2.3]). Initially 
we assume T = 1 and run Algorithm [T] accordingly. If we do not exhaust the budget, we 
assume T = 2, and rerun Algorithm [1] for another round. If there is more computational time 
at our disposal, we update our guess to T = 4 and so on. Suppose the real budget is To with 
2 k — 1 < To < 2 k+1 — 1. After i rounds of this doubling strategy, we have exhausted a budget 
of 2 J_1 , with the last round getting a budget of 2 % ~ 2 for i > 2. In particular, the last round 
with a net budget of To is of length at least To/4. Since Theorem [1] applies to each individual 
round, we obtain an oracle inequality where we replace To with To/4; we can be agnostic to 
the prior knowledge of the budget at the expense of slightly worse constants. 

(c) For ease of presentation, Algorithm [T] and Theorem [T] use a specific setting of the coarse-grid 
size, which corresponds to setting A = 1 in Definition [TJ In our proofs, we establish the theorem 
for arbitrary A > 0. As a consequence, to obtain slightly sharper bounds, we may optimize this 
choice of A; we do not pursue this here. 

Now let us turn to a specialization of Theorem Q] to the settings outlined in Examples [T] and [2j 
The following corollary shows oracle inequalities under the computational restrictions that are only 
logarithmically worse than those possible in the computationally unconstrained model selection 
procedure ([3]). 

Corollary 1. Let m > be a specified constant. 

(a) In the setting of ExampleUl define n so that nT/d is the number of samples that can be processed 
by the inference algorithm A using T units of computation. Assume that T is large enough that 
nT > d/B and nT > d/(ArfX 2 ). With probability at least 1 — 2ki exp(— m), the output f of 
Algorithm^ satisfies 

R<J)<Jri \r* + y^il^W (s riX + V8K 2 Vm + loglog 2 (16BnT/d)) j ■ 

(b) In the setting of Example [3 define n so that nT/di is the number of samples that can be 
processed by the inference algorithm A using T units of computation. Assume that T is large 
enough that nT > 1 and nT > d\/B. With probability at least 1 — 2k\ exp(— m), the output f 
of Algorithm^ satisfies 



9 



2.5 Proofs 



As remarked after Theorem [JJ we will present our proofs for general settings of A > 0. For the 
proofs of Theorem [1] and Corollary [1] in this slight generalization, we define S\ as a set satisfying 
the coarse grid condition with parameters A, m and s(A), with s(A) satisfying 



S (A) > 



log 1 + 



B 



log(l + A) 



+ 2. 



(13) 



First, we show that this inequality is ensured by the choice given in Algorithm [TJ To see this, 
notice that 



log 1 + 



B 



7i(7>(A)) 



log(l + A) 



+ 2 < 



log 1 + 



B 



7i(T,l) 



log(l + A) 

log(l + £ni(T)) 
log(l + A) 

Thus, for A = 1, choosing s(A) = |~log 2 (1 + Bni(T))~\ + 2 suffices. 
We require the additional notation 



+ 2 



+ 2. 



K(X) := max-jj : j G S x }, 



(14) 



where 



S\ = {jl,---,j s (X)} ( 15 ) 

is the natural generalization of the set S defined in Algorithm [TJ jk+i is chosen as the largest index 
for which ^ j(T / s(X)) < (1 + X) k j 1 (T/s(X)). We begin the proof of TheoremUJby showing that any 
s(A) satisfying (|13|) ensures that any class j > K(X) must have penalty too large to be optimal, 
so we can focus on classes j < K(X). We then show that the output / of Algorithm [TJ satisfies 
an oracle inequality for each class in S\, which is possible by an adaptation of arguments in prior 
work [111 ] . Using the definition of our coarse grid set (Definition [TJ , we can then infer an oracle 



inequality that applies to each class j < K(X), and our earlier reduction to a finite model hierarchy 
completes the argument. 



2.5.1 Proof of Theorem Q] 

First we show that the selection of the set S\ satisfies Definition [TJ 

Lemma 1. Let {7,;} be a sequence of increasing positive numbers and for each k € {0, . . . , s — 1} 
set jk+i to be the largest index j such that jj < (1 + A) fc 7i. Then for each i G N such that i < j^, 
there exists a j € {ji, . . . , j^} such that 7$ < jj < (1 + A)7,,. 

Proof. Let i < jk and choose the smallest j G {ji, J2, • • • ,jk} such that ji < jj. Assume for the sake 
of contradiction that (1 + A)7i < jj. There exists some k' G {0, . . . , s — 1} such that jj < (1 + A) fc 71 
and 7j > (1 + X) k -1 7i, and thus we obtain 

7 l < T ^<(l + A) fc '- 1 7i. (16) 
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Let j' be the largest element smaller than j in the collection {ji, j2, ■ ■ ■ ,jk}- Then by our construc- 
tion, f is the largest index satisfying 7,-/ < (1 + X) k In particular, combining with our earlier 
inequality (|16|) leads to the conclusion that i < f, which contradicts the fact that j is the smallest 
index in ...,j k } satisfying 7; < 7,. □ 

Next, we show that, for s(A) satisfying (fT3|) , once the complexity penalty of a class becomes 
too large, it can never be the minimizer of the penalized risk in the oracle inequality (lllj) . See 
Appendix [X] for the proof. 

Lemma 2. Fix A > and m > 0, recall the definition fli^| j of K{\), and let i* be a class that 
attains the minimum in the right side of the bound Ul\) . We have i* < K(X). 

Equipped with the lemmas, we can restrict our attention only to classes i £ S\. To that end, the 
next result establishes an oracle inequality for our algorithm compared to all the classes in this set. 

Proposition 1. Let f = f? be the function chosen from the class i selected by the procedure ilO\). 
where S = S\ and s = s(X). Under the conditions of Theorem^ with probability at least 1 — 
2ki exp(— m) 

T 



R{f) < min i R* + 2 7i 

ies x 



s(X) 



+ K21 



'2(771 + log s(A)) 
ni(T/s(X)) 



The proof of the proposition follows from an argument similar to that given in though we 
must carefully reason about the different number of independent samples used to estimate within 
each class Ti . We present a proof in Appendix [Al We can now complete the proof of Theorem Q] 
using the proposition. 

Proof of Theorem [TJ Let i be any class (not necessarily in S\) and j £ S\ be the smallest 
class satisfying j > i. Then, by construction of S\, we know from Lemma [Tj that 



2 7i 



S (A) 



+ K21 



'2(m + logs(A)) 
ni(T/s(X)) 



<27i 




2(m + logs(A)) 
nj(T/s(X)) 



<(1 + A) 



In particular, we can lower bound the penalized risk of class i as 

T 



+ K 2 \ 



'2(m + logs(A)) 
rii(T/s{X)) 



R* + (l + X) 



27i 



s(X) 



+ K21 



'2(m + logs(A)) 
ni{T/s{X)) 



> Rj + 2 7j - 



s(X) 



+ H2\ 



'2(m + logs(A)) 
nj(T/s(X)) ' 



where we used the inclusion assumption [A] to conclude that R*j < R*. Now applying Proposition Q] 
the above lower bound, and Lemma[2]in turn, we see that with probability at least 1 — 2k± exp(-m) 



R(f) < min <^ R* 



< 



< 



mm 

--1,2,...,K(X) 



inf 

=1,2,3,. 
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For A = 1 (which we have seen satisfies (|13p ). this is the desired statement of the theorem. 



□ 



2.5.2 Proof of Corollary CD 

Under the conditions of Example [H and the assumption that nT > d/(Ar 2 X 2 



are satisfied with rij(T) = nT/d and 7i(T) 

implies that 71 satisfies Assumption iDlfb).) Also, since nT > d/B, we have 



Assumptions lAllDl 
2riX^d/(nT). (In particular, nT > d/(Ar 2 X 2 ) 



log 2 



BnT\ 

1 + — 



+ 2 < \og 2 (2BnT/d) + 3 = log 2 (16 B 'nT j 'd). 



Substituting into Theorem Q] gives the first part of the corollary. 

Similarly, under the conditions of Example [5] and the assumption that nT > 1 , Assumptions |A} 
[D] are satisfied with ni(T) = nT/di and Ji(T) = di/V nT. (In particular, nT > 1 implies that 
71 satisfies Assumption iDTbl.) Also, since nT > di/B, we have s < log 2 (16-BnT/c() as before. 
Substituting into Theorem Q] gives the second part of the corollary. 



3 Fast rates for model selection 

Looking at the result given by Theorem we observe that irrespective of the dependence of the 
penalties 7^ on the sample size, there are terms in the_^oracle inequality that always decay as 

for classical model selection results in 



0(1/ \J ni(T ' j s(X))) . A similar phenomenon is noted in 
computationally unconstrained settings; under conditions similar to Assumption^ this inverse-root 
dependence on the number of samples is the best possible, due to lower bounds on the fluctuations 
of the empirical process (e.g. fiol . Theorem 2.3]). On the other hand, under suitable low noise 
conditions or curvature properties of the risk functional @ , [H, Q, it is possible to obtain 
estimation guarantees of the form 



R(f) = R(f*) + F 



n 



where / (approximately) minimizes the n-sample empirical risk. Under suitable assumptions, com' 



plexity regularization can also achieve fast rates for model selection [SL Il7l|. In this section, we show 



that similar results can be obtained in computationally constrained inferential settings. 
3.1 Assumptions and example 

We begin by modifying our concentration assumption and providing a motivating example. 

Assumption E. For each i, let f* £ argminj 6 jr. R(f). Then there are constants K\,K2 > such 
that for any budget T and the corresponding sample size nj(T) 



sup (R(f) - R(f*) - 2{R ni{T) {f) - R ni{T) {ft)) ) > li{T) + K 2 e 



sup (R ni{T) (f) - R ni(T )(fD - 2(R(f) - R(f*))) > 7i(T) + K 2 e 



< K\ exp(— nj(T)e). (17a) 

< K\ exp(— m(T)e). (17b) 
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Contrasting this with our earlier Assumption [Cl we see that the probability bounds (|17a|) and (|17bD 
decay exponentially in e rather than e 2 , which leads to faster sub-exponential rates for estimation 
procedures. Concentration inequalities of this form are now well known [g, 18|, 3], and the paper 
uses an identical assumption. 

Before continuing, we give an example to illustrate the assumption. 

Example 3 (Fast rates for classification). We consider the function class hierarchy based on in- 
creasing dimensions of Example [21 We assume that the risk R(fe) = E[£(y, fo(x))} and that the loss 
function i is either the squared loss £(y, fe(x)) = (y — fo(x)) 2 or the exponential loss from boosting 
£(y,fo(x)) = exp(—yfg(x)). Each of these examples satisfies Assumption [El with 

7 * (T) = C njr) ' (18) 

for a universal constant c. This follows from Theorem 3 of 0] (which in turn follows from Theorem 



3.3 in [6] combined with an argument based on Dudley's entropy integral [15j]). The other parameter 



settings and computational considerations are identical to those of Example [2j 

If we define fi = A (i, T), then using Assumption [BT d) (that R ni (T){fi) ~ Rni{T){fi) < 7iC0) m 
conjunction with Assumption (117aj) . we can conclude that for any time budget T, with probability 
at least 1 — k\ exp(— m), 

R(f i )<R(f:) + 3 li (T) + ^-. (19) 

ni(T) 



One might thus expect that by following arguments similar to those in [8|], it would be possible 
to show fast rates for model selection based on Algorithm [TJ Unfortunately, the results of 0] 
heavily rely on the fact that the data used for computing the estimators fi is the same for each 
class i, so that the fluctuations of the empirical processes corresponding to the different classes are 
positively correlated. In our computationally constrained setting, however, each class's estimator 
is computed on a different sample. It is thus more difficult to relate the estimators than in previous 
work, necessitating a modification of our earlier Algorithm Q] and a new analysis, which follows. 

3.2 Algorithm and oracle inequality 

As in Section [2[ our approach is based on performing model selection over a coarsened version of 
the collection Ti, J-2, . . .. To construct the coarser collection of indices, we define the composite 
penalty term (based on Assumption lEj) 

-frr \ on f T \ i K 2 m + 2logs 

Based on the above penalty term, we define our analogue of the coarse grid set (|9|). 

We give our modified model selection procedure in Algorithm [2j In the algorithm and in our 
subsequent analysis, we use the shorthand Ri(f) to denote the empirical risk of the function / on 
the rii(T) samples associated with class i. Our main oracle inequality is the following: 

Theorem 2. Let f = A(i, T/s(\)) be the output of the algorithm A for class i specified by the pro- 
cedure \21}) . Let Assumptions^^ a^^E be satisfied. With probability at least 1 — 2k,\ exp(— m) 

£ JsL. {* + 40 ^ (?) + 10 ^'^m} ■ <22> 
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Algorithm 2 Computationally budgeted model selection over hierarchies with fast concentration 
Input: Model hierarchy {J~i} with corresponding penalty functions 7$, computational budget T, 

upper bound B on the minimum risk of class 1, and confidence parameter m > 0. 

Construction of the coarse-grid set S: 

Set s = flog 2 (l + Bni(r))l + 2. 

for k = to s — 1 do 

Set jk+i to be the largest class for which 7 -(T/s) < 2 k j 1 (T/s). 

end for 

Set S = {j k : k = l,...,s}. 
Model selection estimate: 
Set fi = A (£, r/s) for i £ S. 

Select the class j G 5a to be the largest class that satisfies 

for all j £ S such that j < i. 
Output the function A (i,T/s) . 



Furthermore, if n\(T) = 0{T) then s = O(logT). 

By inspection of the bound (fTUJ) — achieved by devoting the full computational budget T to the 
optimal class — we see that Theorem [2js oracle inequality has dependence on the computational 
budget within logarithmic factors of the best possible. 

The following corollary shows the application of Theorem [2] to the classification problem we 
discuss in Example [3l 

Corollary 2. In the setting of Example^ define n so that nT/di is the number of samples that can 
be processed by the inference algorithm A using T units of computation. Assume that nT > ed\, 
nT > d\/B, and choose the constant c in the definition I118\) of 7i(T) such that c > l/d\. With 
probability at least 1 — 4«i exp(— m), the output f of Algorithm^ satisfies 

nT \ 
d?log 2 (16BnT/di)J 

+ k 2 (m + loglog 2 (16BnT/di)) 

3.3 Proofs of main results 

In this section, we provide proofs of Theorem [2] and Corollary [2j Like our previous proof for 
Theorem [H we again provide the proof of Theorem [2] for general settings of A > 0. The proof of 
Theorem [2] broadly follows that of Theorem [H in that we establish an analogue of Proposition 
which provides an oracle inequality for each class in the coarse- grid set S\. We then extend the 
proven inequality to apply to each function class J~i in the hierarchy using the definition (jSJ) of the 
grid set. 



R(f) < ( W_ U + (4c* log ( 
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Proof of Theorem [2} Let rii be shorthand for rii(T/s(X)), the number of samples available to 
class i, and let Ri(f) denote the empirical risk of the function / using the rii samples for class i. In 
addition, let 7i(nj) be shorthand for ji(rii(T/s(X))), the penalty value for class i using rii(T/s(X)) 
samples. With these definitions, we adopt the following shorthand for the events in the probability 
bounds (|17ap and (117b|h Let e = {e^} be an s(A)-dimensional vector with (arbitrary for now) 
positive entries. For each pair of indices i and j define 

£?'(e0 := | sup (i?(/) - R(f*) - 2 (£(/) - £(//))) < 7j K) + K 2 e^ (23a) 

:= { sup (£(/) - %/*) - 2 (#(/) - i?(/*))) < 7 ;(n 4 ) + KfcA (23b) 

and define the joint events 

£ 1 {e):= |J |J and £ 2 (e) := |J |J (24) 



With the "good" events (J24J) defined, we turn to the two technical lemmas, which relate the risk 
of the chosen function f? to /* for each i £ S\. We provide proofs of both lemmas in Appendix iBl 
To make the proofs of each of the lemmas cleaner and see the appropriate choices of constants, we 
replace the selection strategy (12 1 [) with one whose constants have not been specified. Specifically, 
we select i as the largest class that satisfies 

Hh) + °nr (^y ) + c ^ < %/,) + c l7j (J^j (25) 

for j G S with j < i. 

Lemma 3. Let the events \23a]) and 1123b]) hold for all i,j G S\, that is, £\{e) and £2^) hold. Then 
using the selection strategy [25]), for each j < i with j G S\ we have 



R(f-) < R(f*) + \ 



y - cij 7?(™?) + (6 + c^jjirij) + 2K 2 ej + Q - c 2 ^ /c 2 ej- 



We require a different argument for the case that j > i, and the constants are somewhat worse. 

Lemma 4. Let the events 1123a]) and ^23b]) hold for all i,j G S\, that is, £\{e) and ^(e) hold. 
Assume also that c\ > 17/2 and c 2 > 7/2. Then using the selection strategy \25]). for each j > i 
with j G 5a we have 

R(fy < R(f*) + s(X) [(2 C1 + S)jj(rij) + (2c 2 + l) £j ] . 

We use Lemmas [3] and H] to complete the proof of the theorem. When Assumption |E] holds, the 
probability that one of the events £\{e) and £2^) fails to hold is upper bounded by 

i(e) c U£ 2 (e) c )< Yl n4 j ^T)+ E n4 j ^i) C )<^i E exp^r^T/s^K) 
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by a union bound. Thus, we see that if we define the constants 



2- 



m + log(s(A)) 
ni(T/s(X)) ' 



we obtain that all of the events £ 1 (e^) and £ 2 {(-i) hold with probability at least 1 — 2k\ exp(— m). 
Applying Lemmas [3] and H] with the choices c\ = and C2 = | , we obtain that with probability at 
least 1 — 2k± exp( 



-m) 



R(f 7 )< min {R(f*)+s(\) 



20 7i K) + 10 



m + log(s(A)) 
ni(T/s(X)) 



(26) 



The inequality (|26p is the analogue of Proposition [T] in the current setting. Given the inequality, 
the remainder of the proof of Theorem [5] follows the same recipe as that of Theorem [TJ Recalling 
the notation ([H]) defining K(X), we apply the inequality ([26]) with the definition of the grid set (fT5|) 
to obtain an oracle inequality compared to all classes i < K(X). Then provided that 



S (A)> 



log 1 + 



B 

a (A)7 1 (T, S (A)) 



log(l + A) 



+ 2, 



we can transfer the result to the entire model hierarchy as before. For A 
employed in Algorithm [2] again suffices for this. 



the choice of s 
□ 



Proof of Corollary [2} In the setting of Example [3j we set rii(T) = nT/di and 

cdi log(rii(T) / di) cdf log(nT/df) 



rn(T) nT 



It is straightforward to verify that the conditions of the corollary ensure that Assumptions [Al iBl iDl 
and [E] are satisfied. In particular, nT > ed\ and c > \/d\ ensure that 71 (T) > \/n\(T). Also, 
nT > d\/B ensures that s < log 2 {16BnT/di). Substituting 7 j, nj and s into Theorem [2] gives the 
result. □ 



4 Oracle inequalities for unstructured models 

To this point, our results have addressed the model selection problem in scenarios where we have 
a nested collection of models. In the most general case, however, the collection of models may be 
quite heterogeneous, with no relationship between the different model families. In classification, 
for instance, we may consider generalized linear models with different link functions, decision trees, 
random forests, or other families among our collection of models. For a non-parametric regression 
problem, we may want to select across a collection of dictionaries such as wavelets, splines, and 
polynomials. While this more general setting is obviously more challenging than the structured cases 
in the prequel, we would like to study the effects that limiting computation has on model selection 
problems, understanding when it is possible to outperform computation-agnostic strategies. 
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4.1 Problem setting and algorithm 

When no structure relates the models under consideration, it is impossible to work with an infinite 
collection of classes within a finite computational time — any estimator must evaluate each class 
(that is, at least one sample must be allocated to each class, as any class could be significantly 
better than the others). As a result, we restrict ourselves to finite model collections in this section, 
so that we have a sequence F±, . . . , Fk of models from which we wish to select. Our approach to the 
unstructured case is to incrementally allocate computational quota amongst the function classes, 
where we trade off receiving samples for classes that have good risk performance against exploring 
classes for which we have received few data points. More formally, with T available quanta of 
computation, it is natural to view the model selection problem as a T round game, where in each 
round a procedure selects a function class i and allocates it one additional quantum of computation. 

With this setup, we turn to stating a few natural assumptions. We assume that the computa- 
tional complexity of fitting a model grows linearly and incrementally with the number of samples, 
which means that allocating an additional quantum of training time allows the learning algorithm A 
to process an additional rij samples for class F%. In the context of Sections [2] and this means that 
we assume riiit) = trii for some fixed number m specific to class i. This linear growth assumption 
is satisfied, for instance, when the loss function £ is convex and the black-box learning algorithm 



A is a stochastic or online convex optimization procedure [13, 12J] • We also require assumptions 
similar to Assumptions iBl and ICl 

Assumption F. Let A(i,T) G F{ denote the output of algorithm A when executed for class T{ 
with a computational budget T. 

(a) For each i, there exists an rn G N such that in T units of time, algorithm A can compute 
A(i,T) using n(F samples. 

(b) For each i G [K], there is a function ji and constants k\,k 2 > such that for any T G N ; 
\R niT (A(i,T))-R(A(i,T))\ > ji( ni T) + K 2 e) < Kl exp(-4n 4 Te 2 ). (27) 



(c) The output A(i,T) is a ^i(riiT)-minimizer of R ni T, that is, 

R niT (A(i,niT)) - inf R ni T{f) < li{niT). 

(d) For each i, the function ji satisfies 7i(n) < Cin~ ai for some oti > 0. 

(e) For any fixed function f G Fi, ¥{\R n {f) — R{f)\ > K2^) < ki exp(— 4ne 2 ). 

Comparing to Assumptions [B] and [Cj we see that the main difference is in the linear time assump- 
tion (|a]) and growth assumption ([d]). In addition, the complexity penalties and function classes 
discussed in our earlier examples satisfy Assumption IH 

We now present our algorithm for successively allocating computational quanta to the function 
classes. To choose the class i receiving computation at iteration t, the procedure must balance 
competing goals of exploration, evaluating each function class Fi adequately, and exploitation, 
giving more computation to classes with low empirical risk. To promote exploration, we use an 
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Algorithm 3 Multi-armed bandit algorithm for selection of best class i. 
For each i S [K], query rij examples from class T{. 
for t = K + 1 to T do 

Let rii(t) be the number of examples seen for class i until time t 

Let i t = argmin ie[x] R(j,rii(t)) - \J^- 
Query m t examples for class it- 
end for 

Output i, the index of the most frequently queried class. 



optimistic selection criterion to choose class i, which — assuming that J~i has seen n samples at this 
point — is 

R(i,n) = R n (A{i,n))-- fi (n) - + Upm). (28) 

The intuition behind the definition of R(i, n) is that we would like the algorithm to choose func- 
tions / and classes i that minimize R n (f) + 7i(Tni) « R(f) + 7i(Trtj), but the negative 7i(n) 
and y/log K/n terms lower the criterion significantly when n is small and thus encourage initial 
exploration. The criterion (|28p essentially combines a penalized model-selection objective with an 
optimistic criterion similar to those used in multi-armed bandit algorithms Q. Algorithm E] con- 
tains the formal description of our bandit procedure for model selection. Algorithm [3] begins by 
receiving rii samples for each of the K classes T% to form the preliminary empirical estimates f|28j) ; 
we then use the optimistic selection criterion until the computational budget is exhausted. 



4.2 Main results and some consequences 

The goal of the selection procedure is to find the best penalized class i* : a class satisfying 

i* € argmin < inf R(f) + ^(Tni) > = argminji?* + ^(Trii)} . 
ie[K] l/e-Fi J ie[K] 

To present our main results for Algorithm El we define the excess penalized risk Aj of class i: 

Aj := R* + ji(Tm) - R*, - 7i * (Trie ) > 0. (29) 

Without loss of generality we assume that the infimum in R* = inf j g jr. R(f) is attained by 
a function /* (if not, we use a limiting argument, choosing some fixed /* such that R(f*) < 
inf f^jr. R{f) + 5 for an arbitrarily small 5 > 0). 

The gains of a computationally adaptive strategy over naive strategies are clearest when the 
gap (I29j) is non-zero for each i, though in the sequel, we forgo this requirement. Under this 
assumption, we can follow the ideas of Auer et al. 0] to show that the fraction of the computational 
budget allocated to any suboptimal class i 7^ i* goes quickly to zero as T grows. We provide the 
proof of the following theorem in Section 14.31 

Theorem 3. Let Alg. be run for T rounds, and let Ti{t) be the number of times class i is queried 
through round t. Let Aj be defined as in \29\) and Assumption^ hold, and assume that T > K. 
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Define = max{l/aj, 2}. There is a constant C such that 



11: 




where c% and ati are the constants in the definition IWd]) of the concentration function 7^ . 

At a high level, this result shows that the fraction of budget allocated to any suboptimal class 
goes to at the rate ^Kp ( T \ . Hence, asymptotically in T, the procedure performs almost as 
if all the computational budget were allocated to class i*. To see an example of concrete rates that 
can be concluded from the above result, let F±, ... , Fk be model classes with finite VC-dimension0 
so that Assumption IF1 is satisfied with ai = 5. Then we have 

Corollary 3. Under the conditions of Theorem^ assume F\, . . . ,Fk are model classes of finite 
VC-dimension, where F{ has dimension d{. Then there is a constant C such that 

^, rm/mX , ^maxjdj, Kn log T} , m / . m . ^maxjcL, k 2 , logT}\ ki 
E[Ti(T)] < C 1 2 B J and P Tj(T) > C 1 2 & J < 



A lower bound by Lai and Robbins 19[] for the multi-armed bandit problem shows that Corol- 
lary [3] is nearly optimal in general. To see the connection, let F% correspond to the ith arm in a 
multi-armed bandit problem and the risk R* be the expected reward of arm % and assume w.l.o.g. 
that R* G [0, 1]. In this case, the complexity penalty ji for each class is 0. Let pi be a distribution 
on {0, 1}, where Pi(l) = R* and p«(0) = 1 — R* (let pi = Pi(l) for shorthand). Lai and Robbinsl 



give a lower bound that shows that the expected number of pulls of any suboptimal arm is at least 
E[Tj(T)] = (logT/KL (pi\pi*)), where pi and pi* are the reward distributions for the ith. and 
optimal arms, respectively. An asymptotic expansion shows that KL(pi\\pi*) = A 2 /(2pj(l — pi)), 
plus higher order terms, in this case; Corollary [3] is essentially tight. 

The condition that the gap Aj > may not always be satisfied, or Aj may be so small as to 
render the bound in Theorem [3] vacuous. Nevertheless, it is intuitive that our algorithm can quickly 
find a small set of "good" classes — those with small penalized risk — and spend its computational 
budget to try to distinguish amongst them. In this case, Algorithm [3] does not visit suboptimal 
classes and so can output a function / satisfying good oracle bounds. In order to prove a result 
quantifying this intuition, we first upper bound the regret of Algorithm [31 that is, the average 
excess risk suffered by the algorithm over all iterations, and then show how to use this bound for 
obtaining a model with a small risk. For the remainder of the section, we simplify the presentation 
by assuming that = a and define /3 = max{l/a, 2}. 

Proposition 2. Use the same assumptions as Theorem^ but further assume that oti = a for all 
i. With probability at least 1 — K1/TA' 3 , the regret (average excess risk) of Algorithm^ satisfies 



i=l \ i=l / 



V/3 



for a constant C dependent on a. 

1 Similar corollaries hold for any model class whose metric entropy grows polynomially in log |. 
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Our final main result builds on Proposition [2] to show that when it is possible to average 
functions across classes J~i, we can aggregate all the "played" functions ft, one for each iteration 
t, to obtain a function with small risk. Indeed, setting f t = A(it,ni t (t)), we obtain the following 
theorem (whose proof, along with that of Proposition [2j we provide in Appendix iDl): 

Theorem 4. Use the conditions of Proposition^ Let the risk function R be convex on J-~iU. . \J>Tk, 
and let ft be the function chosen by algorithm A at round t of Alg. [3 Define the average function 
fx = if Ylt=i ft- There are constants C, C (dependent on a) such that with probability greater 
than 1 - 2k 2 /(TK 3 ), 

V/3 



R{fr) < R* +n*{T ni ,) + 2eK 2 T-P^T 




K 



=i 



c i n i a + K 2 n i 2 \J\ogK + K 2 n i 2 \J\ogT 



Let us interpret the above bound and discuss its optimality. When a = ^ (e.g., for VC classes), 
we have /3 = 2; moreover, it is clear that ~ = ^(-^0- Thus, to within constant factors, 



R(f T ) = R*«+ ll *(Tn l *) + 



y K maxjlog T, log K} 



T 



Ignoring logarithmic factors, the above bound is minimax optimal, which follows by a reduction 
of our model selection problem to the special case of a multi-armed bandit problem. In this case, 
Theorem 5.1 of Auer et al. 0] shows that for any set of K, T values, there is a distribution over the 
rewards of arms which forces £l(y/KT) regret, that is, the average excess risk of the classes chosen 
by Alg. [3]must be f2(V KT), matching Proposition [2] and Theorem[4l 

The scaling 0{\[K) is essentially as bad as splitting the computational budget T uniformly 
across each of the K classes, which yields (roughly) an oracle inequality of the form 

y / K log K s 
\JTni* 

Comparing this bound to TheoremUl we see that the penalty 7$ in the theorem is smaller. The other 
key distinction between the two bounds (ignoring logarithmic factors) is the difference between 

— and . 



R(f)=R**+ li *(Tn i */K) + 



Hi Hi 



1=1 

When the left quantity is smaller than the right, the bandit-based Algorithm [3] and the extension 
indicated by Theorem U] give improvements over the naive strategy of uniformly splitting the budget 
across classes. However, if each class has similar computational cost rii, no strategy can outperform 
the naive one. 

We also observe that we can apply the online procedure of Algorithm [3] to the nested setup of 
Sections [2] and [3] as well. In this case, by applying Algorithm [3] only to elements of the coarse-grid 
set S\, we can replace K in the bounds of Theorems [3] and H] with s(A), which gives results similar 
to our earlier Theorems [1] and (2) In particular, if we are in the setup of Theorem [3] with a large 
separation between penalized risks, then Algorithm [3] applied to the coarse-grid set is expected to 
outperform a uniform allocation of budget within the set as in Sections [2] and [3j 
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4.3 Proof of Theorem H 



At a high level, the proof of this theorem involves combining the techniques for analysis of multi- 
armed bandits developed by Auer et al. 0] with Assumption [Fj. We start by giving a lemma that 
will be useful to prove the theorem. The lemma states that after a sufficient number of initial 
iterations t, the probability that Algorithm [3] chooses to receive samples for a sub-optimal function 
class i i* is extremely small. Recall also our notational convention that fii = max{l/aj,2}. 

Lemma 5. Let Assumption\F\hold. For any class i, any Si S [1,7"] and Si* E [t,T] where r satisfies 



T > 



we have 



/logT — . „ /logT \ 2«i 
P R(i,niSi) - k 2 \ < R[i ,ni*Si*) - k 2 \ < 



msi ' V rii*Si* I (TK) 



We defer the proof of the lemma to Appendix £3 though at a high level the proof works as 
follows. The "bad event" in Lemma which corresponds to Algorithm [3] selecting a sub-optimal 
class i ^ i* , occurs only if one of the following three errors occurs: the empirical risk of class i 
is much lower than its true risk, the empirical risk of class i* is higher than its true risk, or Si is 
not large enough to actually separate the true penalized risks from one another. The assumptions 
of the lemma make each of these three sub-events quite unlikely. Now we turn to the proof of 
Theorem [3l assuming the lemma. 

Let it denote the model class index i chosen by Algorithm [3] at time t, and let Si(t) denote 
the number of times class i has been selected at round t of the algorithm. When no time index is 
needed, Sj will denote the same thing. Note that if it = i and the number of times class i is queried 
exceeds r > 0, then by the definition of the selection criterion ()28p and choice of it in Alg. [31 for 
some Si £ {t, . . . , t — 1} and s^ G {1, . . . , t — 1} we have 



R{i,niSi) - n 2 \ < R{i ,rii*Si*) - k 2 \ . 

V riiSi V rii*Si* 

Here we interpret R(i, niSi) to mean a random realization of the observed risk consistent with the 
samples we observe. Using the above implication, we thus have 

T T 

Ti(T) = l+ £ Hi t = i) < t+ ]T I(i t = i,T l (t-l)>r) 

t=K+l t=K+l 



<t V TliSi 0<s<t V Uj*S 



^ ( I log T I log T 

< t + I min R(i,niSi) — K 2 \j — < max^R(i* ,ni*Si*) — n 2 ^ 

t=K+l \- Si< 

, , , „ \ V TtiSi V nt* Si 

t=l Sit =1 Si=T 



To control the last term, we invoke Lemma [5] and obtain that 

EmT)]<T + YYY2-^-<T + ^- A . 
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Hence for any suboptimal class i ^ i* , E[Tj(n)] < Tj + k\/(TK a ), where Tj satisfies the lower bound 
of Lemma [5] and is thus logarithmic in T. Under the assumption that T > K, for i ^ i* , 



E[Ti(T)] < C 



(31) 



a maxfl/^,2} 



for a constant C < 2 • 4"«ax{Va il 2} - Now 

we prove the high-probability bound. For this part, we 
need only concern ourselves with the sum of indicators from (|30[) . Markov's inequality shows that 



Thus we can assert that the bound ()31|) on Tj(T) holds with high probability. 

Remark: By examining the proof of Theorem [3j it is straightforward to see that if we modify 
the multipliers on the square root terms in the criterion (|28p by mK2 instead of k.2, we get that the 
probability bound is of the order T 3_4m K~ im , while the bound on Tj(T) is scaled by m l / ai . 

5 Discussion 

In this paper, we have presented a new framework for model selection with computational con- 
straints. The novelty of our setting is the idea of using computation — rather than samples — as the 
quantity against which we measure the performance of our estimators. As our main contribution, 
we have presented algorithms for model selection in several scenarios, and the common thread in 
each is that we attain good performance by evaluating only a small and intelligently-selected set 
of models, allocating samples to each model based on computational cost. For model selection 
over nested hierarchies, this takes the form of a new estimator based on a coarse gridding of the 
model space, which is competitive (up to logarithmic factors) with an omniscient oracle. A minor 
extension of our algorithm is adaptive to problem complexity, since it yields fast rates for model 
selection when the underlying estimation problems have appropriate curvature or low-noise prop- 
erties. We also presented an exploration-exploitation algorithm for model selection in unstructured 
cases, showing that it obtains (in some sense) nearly optimal performance. 

There are certainly many possible extensions and open questions that our work raises. We 
address the setting where the complexity penalties are known and can be computed easily in 
closed form. Often it is desirable to use data-dependent penalties @, since they adapt 
to the particular problem instance and data distribution. It appears to be somewhat difficult to 
extend such penalties to the procedures we have developed in this paper, but we believe it would 
be quite interesting. Another natural question to ask is whether there exist intermediate model 
selection problems between a nested sequence of classes and a completely unstructured collection. 
Identifying other structures — and obtaining the corresponding oracle inequalities and understanding 
their dependence on computation — would be an interesting extension of the results presented here. 

More broadly, we believe the idea of using computation, in addition to the number of samples 
available for a statistical inference problem, to measure the performance of statistical procedures 
is appealling for a much broader class of problems. In large data settings, one would hope that 
more data would always improve the risk performance of statistical procedures, even with a fixed 
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computational budget. We hope that extending these ideas to other problems, and understanding 
how computation interacts with and affects the quality of statistical estimation more generally, will 
be quite fruitful. 
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A Auxiliary results for Theorem [T] and Corollary [T] 

We start by establishing Lemma [2j To prove the lemma, we first need a simple claim. 
Lemma 6. Let c\ > c 2 > 0, s > 0, and define 



'T\ 2(m + log S ) 

i x = argmm + q h, - + ft 2 i 



i=l,2,3, 



rii(T/s) 



'T\ 2(m + logs) 

argmm < + c 2 7» — + K21 



i=l,2,3,. 



sj \/ iii(T/s) 



Then under the monotonicity assumptions\B[ we have i\ < i\. 

Proof. Recall the shorthand definition (jSJ) of 7^ Under the monotonicity assumptions E^a)-(b), 7^ 
is monotone increasing in i. By the definitions of i\ and i\ we have 

K + cij lt (T, s) < R* 2 + ci7j* (T, s) and R* 2 + c 2 ^ (T, s) < R* h + c 2 ^ (T, s) . 

Adding the two inequalities we obtain 

(ci - c 2 )7i* (T, s) < (ci - c 2 )7i* (T, s) . 

Since c\ — c 2 > by assumption, the monotonicity of 7, guarantees %\ < i\. □ 

Proof of Lemma [2} Lemma [6] allows us to establish a simpler version of Lemma [2 Since 
1 + A > 1, it suffices to establish zq < K(X), where 



. Jo*, ( T \ 2(m + log g (A)) 1 
i = argmm ( R { + 7, — — + k 2 W S . 

1=1,2,3,... [ V s ( A )y V n i( T / s ( A )) J 

Let 7j be shorthand for the quantity ([8]) as usual. Recalling the construction of S\ in (115|) . we 
observe that any class i > i^(A) satisfies 

(l + Ar (A)-2 7i ( T , S (A))<7 1 (T, S (A)) 
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The setting (fT3|) of s(A) ensures that 

B 



(1 + > (1 + A) riog(l+B/7 1 (T, S (A)))/log(l+A)l > exp A Qg A + 



7i(7>(A)) 



so that 

(1 + A) «(A)-a 7i (T) s(A)) > B + 7i (T) S (A)) > ii* + 7l (T, S (A)) > inf + 1{ (T, S (A))} . 

Hence we observe that for i > K[X), 

i?*+7,(T, S (A))>7,(T, S (A)) 

>(l + A) s(A) " 2 7i (T,s(X)) 

> .Jf 2 {^ + 7i(r,-(A))}. 

We must thus have «o < -^(A), and Lemma [6] further implies that i* < K(X). □ 

We finally provide a proof for Proposition [U 
Proof of Proposition [TJ Since for any a, b > 0, y^a + \fb < y/2(a + 6), it suffices to control 
the probability of the event 



m > mm {«• + 2 7i (^) + ^^|y + ^H^JA))}- <32> 
For the event (1321) to occur, at least one of 



I ^ ?\ ( T \ K2 I J7l K2 / logs(A) 

R(f) > mm i? ni(T/s(A)) (/ i ) + 7i + Y ]/ ni (T/s(X)) + TV n,(T/ S (A)) (33&) 



or 



mm 



R MT/s W) (fi) + n [-^) + yy ni(T/s(A)) + 



, g5 | wswn , t , ll \s(\)J 2 y m(T/s(\)) 2 V ni(T/s(A)) J 

> ^{ fl|+27i (^) +K2 /5liy +K2 ^S} (33b) 

must occur. We bound the probabilities of the events (|33aj) and (|33bj) in turn. 

If the event (|33ap occurs, by definition of the selection strategy (|10p . it must be the case that 
for some i £ S (namely i = i) 



, ~ . f T \ k 2 I m /«2 / logs(A) 

R(fi) > R ni (T/s(x))(fi)+Ji [^) + Y\j ni (T/s(\)) + Y 



s(A)7 2Vn i (T/s(A)) 2 V m(T/s(X)) 
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since the chosen / minimizes the right side of this display over the classes T\ for i £ S. By a union 
bound, we see that 



R(f) > mm I R ni (T/ s (X)){fi) + 7i 

I 



<P 3iG5s.t. R(fi)> R(fi)+7i 



T \ K 



+ 



k 2 / log s(A) 



s(A); 2 A/ ni{T/s{\)) 2Vn,(T/ S (A)) 



??? 



K2 / fogs(A) 



s(X) ; 2 V n,(T/s(A)) 2 V n;(T/ S (A)) 



< Ki eX P ( — m ~~ log S W) = K l ex P(" 



■m) 



where the final inequality follows from Assumption ICl 

Now we bound the probability of the event (|33b|) . noting that the event implies that 



max < R 



m(T/s(\)) 



(fo-R*-^ 



T \ K2 logs(A) K 2 



m 



> 0. 



s(X)J 2 \/ n t (T/s{\)) 2 V ni (T/s(X)) ^ 
We can thus apply a union bound to see that the probability of the event (]33bp is bounded by 

T \ k 2 logs(A) K 2 



max < R 



m(T/s(x)) 



(f,)-R*- 7l 



s(X)J 2 W th(T/s(\)) 2 V ni(T/s(X)) 



> 



ies 



R 



ni(T/s(X)) 



K 2 / fogS (A) K 2 



s(X)J 2 V m(T/s(X)) 2 V "i(r/s(A)) 



m 



> 



R(f*)-R*>^ 



K 2 / fogs(A) K 2 



2 V ni(T/s(A)) 2 V n t (T/s(X)) 



m 



(34) 



where the final inequality uses Assumption [B^d) , which states that A outputs a 7j-minimizer of 
the empirical risk. Now we can bound the deviations using the second part of Assumption IU| since 
/* is non-random: the quantity (JMJ) is bounded by 



^ Kl exp(-n j( T/ S (A)) 



+ 



m 



(A)) ni (T/s(X)) 



< k\ exp(— m) 



Combining the two events (|33ap and (|33bj) completes the proof of the proposition. 



□ 



B Auxilliary results for Theorem [2] 

Proof of Lemma [3} In the proof of the lemma, assume that both of the events (|24|) hold. 
Recall that we define fj = A(j,rij), so that by the definition (|23ap and Assumption [B] that fj is a 
7j-accurate minimizer of the empirical risk, we have 

R(fj) < R(f*) + 3 7 ,-(%) + K 2 ej (35) 
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for any j. By our assumption that the index j < i, we have fj G J~>, and since the event (|23bj) 
holds for the classes i and j (i.e. £l£ (e>\ occurs), we further obtain that 

Hh) ~ Hfj) ^ 2 ( R (fj) - R UD) + 7i(«?) + ^ (36) 

Applying the earlier bound ([35]) on R(fj) — R(fj) to the inequality ([36]) . we see that 

R-(fj) - %(/*) < 6 7 ,K) + 2*^ + 7i (nj) + K 2 e,. (37) 

Now we again use the fact that the event (|23bp holds so that 6 " (ej) occurs. Using / = /* in the 
event since /* 6 7^, we see that 

2 - R(fi)) > (%/*) - %/f )) - 7,(n ? ) - K2e? . 

Now apply the inequality (|37l) to lower bound Rq(fj) to see that 

2 (#(/*) - R(f?j) > %/,) - - 67j(n 3 -) - 2k 2£j - 7 ,-(n ? ) - 7-(n ? ) - 2« 2e? 

> - - Gjj(nj) - 2K 2 ej - 2^) - 2^, 

where we have used the fact that j < i so T[(nq) > 7j( n i)- Using the condition (I25p that defines 
the selected index i, we obtain 

2 (#(/*) - 

> %/?) + ci7^(n ? ) + c 2 K 2e? - d 7 j(n.,-) - %/f ) - 67,(71,) - 2re 2 e,- - 27^) - 2K 2e? 
= %/ ? ) - %-*) + (d - 2)7,(n ? ) - (6 + Cl )7iK) - 2 K2 e,- + (c 2 - 2)^. 

Finally, we note that by the event (123aj) . since R(fj) — R(f) < for all / € J-j, we have 

R-(fi) < Hh + + 

whence we obtain 

2 - i?(/f )) > (ci - 5/2) 7,(n ? ) - (6 + c 1 ) 7j (n j ) - 2*^ + (c 2 - 5/2)/^. (38) 

Applying the inequality ([35]) for the class i, we have 

R(f*) - R{fi) > R(fj) ~ R(f{) ~ *tM) ~ ^ 

and combining this inequality with the earlier guarantee (j38j) . we find that 

2 (i?(/*) - R(ftj) > (d - 17/2)^) - (6 + d)7j(«,0 - 2K 2 e,- + (c 2 - 9/2)^ 
Rearranging terms, we obtain the statement of the lemma. □ 

In order to prove Lemma 0J we need one more result: 
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Lemma 7. Let the joint events \2J$ hold (i.e. £\{e) and £ 2 (e) )■ For i,j G S\ such that i> j and 

Ri(fj) + ci7j( n i) < Ri(fi) + ci7iK) + c 2 K2£i 

we have 

R(fj) < R{f*) + (2 Cl + 3)7i(Tn) + (2c 2 + l)« 2ei . 

Proof. We begin by noting that since z > j, we have /j G J 7 ,,, and since the event (123aP holds by 
assumption, we have 

- R(f*) < 2 - Ri(f*)) + 7i (ni) + ^ 

Recalling the inequality assumed in the condition of the lemma, we see that 

R(fj) - R(f*) < 2 (^Ri(fi) + cxji{m) + c 2 K 2 ej - d7j(™j) - Ri(f*)j + 7iK) + ^i- 

Applying Assumption [B]^d) on the empirical minimizers, we have Ri(fi) — Ri(f*) < Ji(jii), so 

R(fj) - R(f*) < 2 ((ci + l)7j(ni) + c 2 K 2 ej - d7j( n i)) + 7»(n») + K 2 e;. 

Ignoring the negative term — ciTj(nj) yields the lemma. □ 

Proof of Lemma [4fc For j G 5a, define S\(j) to be the position of class j in the coarse-grid 
set (that is, S\(l) = 1, the next class j G 5a has 5a(j) = 2 and so on). We prove the lemma by 
induction on the class j for j > i, j G S\. Our inductive hypothesis is that 

R(fy < R(f*) + (5 A (j) - 5 A (?) + 1) [(2 C1 + 3)7, (nj) + (2c 2 + l) K2 ej] . (39) 

The base case for j = i is immediate since by assumption, the event (I23ah holds, so we obtain the 
inequality ([35]) . 

For the inductive step, we assume that the claim holds for all i < k < j — 1 such that k G 5a 
and establish the claim for 7. Since i is the largest class in 5a satisfying the condition (|25|) and 
j >i, there must exist a class A: < j in 5a for which 

Rj(fk) + ci7fc(^fc) < Rj(fj) + ci 7 j(nj) + c 2 K 2 ej. (40) 

By inspection, this is precisely the condition of Lemma [71 so 

R(f* k ) < R(fk) < R(f-) + (2a. + 3) 7i (n j ) + (2c 2 + IJkzCj. 

Now there are two possibilities. If k < Lemma [3] applies, and we recall the assumptions on c\ 
and c 2 , which guarantee 2ci + 3 > 6 + c\ and 2c 2 + 1 > 2. If k > i, then we can apply our inductive 
hypothesis since k < j. In either case, we conclude that 

R(fy < R{P k ) + (S X (k) - 5 A (?) + 1) [(2d + 3) 7fc (n fc ) + (2c 2 + l)n 2 e k } 

< R(f* k ) + (5 A (j) - 1 - 5 A (?) + 1) [(2d + 3) 7i (n i ) + (2c 2 + 1)^] , 

where the final inequality uses S\(k) < S\(j) — 1 and the monotonicity assumptions EJa)-(b). Ap- 
plying the relationship (|40p of the risk of to that of /? shows that the inductive hypothesis (|39p 

holds at i. Noting that s(A) > S\(j) — S\(i) + 1 completes the proof. □ 
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C Proof of Lemma [5] 

Following 0] , we show that the event in the lemma occurs with very low probability by breaking it 
up into smaller events more amenable to analysis. Recall that we are interested in controlling the 
probability of the event 



R(i,riiSi) - k 2 \ I < R(i*,m*Si*) - k 2 \ ° gT (41) 
V rusi V rii*Si* 

For this bad event to happen, at least one of the following three events must happen: 



/log if /logT 

Rn iSi (A{i,niSi)) - inf R(f) < -7i(njSi) - k 2 \ k 2 \ (42a) 



R^sAM?,*?**))- inf R (f) >li{ni*sA + K 2 \ ^^ + K 2 J^- (42b) 

/6J=i* V ni*Si* V rii*Si* 



R*+~/i(Tni) <R* + 7i *(Tn i *) + 2 ( 7i(™«Si) + «2a/^^ + «2a/— ) • (42c) 

V V riiSi V msi J 

Temporarily use the shorthand = A(i, rijSj) and /j* = .4 («*, rii*Sj*). The relationship between 
Eqs. (I42ap -( j42cl) and the event in (|41[) follows from the fact that if none of (|42ap ^ (|42cp occur, then 



R{i,msi) - K2\ = RniSiifi) +li{Tni) -%{niSi) - k 2 \ «W 

J22o| / /log if /logT\ 

> inf R(f) + ji(Trii) - 2 ji(riiSi) + k 2 \ h k 2 \ 

feTi y v ^iSi V ^jSj y 

J42c| / /log if /logT' 

> inf R{f)+ji*(Tni*) + 2\"fi(niSi) + K 2 \ h«W 



- 2 Ji{niSi) + k 2 \ h K 2 \ 

y V riiSi V riiSi J 

^ /log if r^gt 



> Rn.* s .*{fi*) + 7i*(Tni*J - 7i(nj*Sj*) - «W «W 

V raj*Si* V 

R(i*,rii*Si*) - k 2 



rii*Si* 

From the above string of inequalities, to show that the event (|4ip has low probability, we need 
simply show that each of (I42ap . (|42bp . and (|42cp have low probability. 

To prove that each of the bad events have low probability, we note the following consequences 
of Assumption ICl Recall the definition of /* as the minimizer of R(f) over the class T{. Then by 
Assumption ICljb"|) , 

R (fi) ~ li( n ) ~ K 2t < R(A (i, n)) - 7i(n) - K 2 e < R n (A (i, n)), 
while Assumptions ICtfcj) and [Clffej) imply 

Rn(A (i, n)) < R n (f*) + 7i (n) < R(f*) + 7i (n) + K 2 e, 
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each with probability at least 1 — K\ exp(— 4ne 2 ). In particular, we see that the events (|42a[) and 
(|42bl) have low probability: 



R rHSi (A(i,niSi)) - R(f*) < -ji(niSi) - k 2 \/ l ° gK - k 2 1 



.'log A" log* 
< «! exp ( -4?ijSj [ 1 

TliSi TliSi 



TliSi 
«1 



TliSi 



(tKY 



Rn.* s .„ (.A (i* , s». )) - JR* > 7i* s». ) + k 2 a / lQg "^ + «2 ■ ' 



log if logT 
< ki exp ( -4rej*Sj* ( 1 

17>i* Si* Tli*Si* 



Tli* Si* 

(tKY 



What remains is to show that for large enough r, (j42cj) does not happen. Recalling the definition 
that R* + ji*(Trii*) = R* + ji(Tm) — Aj, we see that for (|42c|) to fail it is sufficient that 



Aj > 2 7 i(rni) + 2k 2A /^^ + 2k 2 J 



riiT 



Let x Ay := min{x, y} and cc V y := max{x, ?/}. Since 7 j(n) < a % the above is satisfied when 



We can solve (j4*3|) above and see immediately that if 

2 V°iV2 (c . + K2 ^/Io^t + k 2 ^loiA) 1/QlV2 



(43) 



T,; > 



mA 



l/ajV2 



then 

i?| > iT + 2 ( 7i (n i r i ) + k 2 J 1 ^l + « 2 ./i2iI 

y V "iT* V ^jTi 

Thus the event in (|42c]l fails to occur, completing the proof of the lemma. 



(44) 



D Proofs of Proposition [2] and Theorem [4] 

In this section we provide proofs for Proposition [2] and Theorem [U The proof of the proposition 
follows by dividing the model clases into two groups: those for which Aj > 7, and those with small 
excess risk, i.e. A, < 7. Theorem [3] provides an upper bound on the fraction of budget allocated to 
model classes of the first type. For the model classes with small excess risk, all of them are nearly 
as good as i* in the regret criterion of Proposition [2j Combining the two arguments gives us the 
desired result. 

Of course, the proposition has the drawback that it does not provide us with a prescription 
to select a good model or even a model class. This shortcoming is addressed by Theorem [H The 
theorem relies on an averaging argument used quite frequently to extract a good solution out of 
online learning or stochastic optimization algorithms [14l . [24j] . 
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D.l Proof of Proposition [2] 

Define = max{l/aj, 2} as in the conclusion of Theorem [3l and let 6j = q + K2\/logT. Dividing 
the regret into classes with high and low excess penalized risk Aj, for any threshold 7 > we have 
by a union bound that with probability at least 1 — K\/TK 3 , 

k 

^A i T i (T)= Yl A i T i( T )+ E AiT ^ 
i=l {j|Ai> 7 } {j|Ai< 7 } 

ifii K ift 

To simplify this further, we use the assumption that Qj = a for all i. Hence the complexity 
penalties of the classes differ only in the sampling rates n^, that is, 

- 1 K Ch^ 

E A ^( T )^^iE^+^- ( 45 ) 
1=1 1 1=1 * 

Minimizing the bound (|45p over 7 by taking derivatives, we get 



which, when plugged back into (|45l) . gives 

i=i Vi=i ni / 

Noting that log(/3 — 1) < £^ < 1, we see that (/3 — l) 1 /^ < exp(l). Plugging the definition of 
(3 = max{l/a, 2}, so that = min{a, gives the result of the proposition. 

D.2 Proof of Theorem H 

Before proving the theorem, we state a technical lemma that makes our argument somewhat simpler. 
Lemma 8. For < p < 1 and a y 0, consider the optimization problem 

K K 



max ^2 a i x i s -t- < T, Xj > 0. 

27ie solution of the problem is to take xi oc o 1 ^ 1 p \ and the optimal value is 
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Proof. Reformulating the problem to make it a minimization problem, that is, our objective is 
~^2d=i a i x i> we have a convex problem. Introducing Lagrange multipliers 9 > and v £ for 
the inequality constraints, we have Lagrangian 



K / K \ 

£(x, 0, i/) = - ^ OjX? + 9 I ^ Xj - T J - (z/, 

i=l \i=l / 



To find the infimum of the Lagrangian over x, we take derivatives and see that —ciipx P l +9 — Vi = 0, 
or that Xi = a i ^ p^ 1 ^ p ^ 1 \9 — Vi) 1 ^ p ~ 1 \ Since a% > 0, the complimentary slackness conditions 
for v are satisfied with v = 0, and we see that 9 is simply a multiplier to force the sum YliLi x i = T. 

By plugging 



1/(1-P)/Y-^ 1/(1-P) 



That is, Xi oc a*^ 1 , and normalizing appropriately, X{ = Ta^ /(L P1 /Ylt=i a j / 
Xi into the objective, we have 

K „ „p/(!-p) 



i=l a i a i 



i=l 



l/(l-p) 



V^K 1 



2-^=1 a i 

v-ft- „1/(1-P)' 



K \ l -P 

V(i-p) 1 



a , 



□ 



With the Lemma [8] in hand, we proceed with the proof of Theorem UJ As before, we use the 
shorthand (3 = max{l/a, 2} throughout the proof to reduce clutter. We also let Si(t) be the number 
of times class i was selected by time t. Recalling the definition of the regret from (|29p and the 
result of the previous proposition, we have with probability at least 1 — k\/{TK 3 ) 



1 T f K r\ 

- + lh{Tn it )\ <R* + 7 i* (Tn t ») + 2eK 2 T~ 1 ^ £ _ 

t=i \i=i n V 



Using the definition of /* as the minimizer of R(f) over JFi, we use Assumptions ICtjcj) and ICljej) to 
see that for fixed Sj, with probability at least 1 — K\/(TK) , 



%HSi{A(i,ni8i)) < R niSi (f*) + ~/i{niSi) < R(f*) +7iK«i) + K 2 



log A /logT 

+ «21 



(46) 



Denote by /( the output of .4 on round i, that is, ft = A {it, ni t Si t (t)). By the previous equation (|46l) . 
we can use a union bound and the regret bound from Proposition[2]to conclude that with probability 
at least 1 - ^/(TA 3 ) - Kl /(T 3 K 3 ), 



1 ~ 

f 2 R n H s H {t){ft) + Hh(Tn it ) 
t=i 

1 T \ f 

7i(«H*it(*)) + «2 



i=l 

r 



T 5-^ 



t=l 



log A 

n k s k W 

log A" 



+ «2< 



/ logT 



1 T 

+ -^[^+7,(^)1 



t=l 



+ K2 



logT 



+ R(f*) + jiimsi 



V/3 



(47) 
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Now we again make use of AssumptionlCtjbj) to note that with probability at least 1 — ki/(T 4 K 4 ) 

R(ft) < Rn it s H (t)(ft) + li t (n it S k (t)) + K21 



log A" 



+ K 2 



logT 



Using a union bound and applying the empirical risk bound (|47p . we drop the positive ~fi t (Tni t ) 
terms from the left side of the bound and see that with probability at least 1 — Ki/(TK 3 ) — 

T / K \ ^Ifi 

\ £ R Ut) <R* + H* ( Tn i* ) + 2e« 2 T- 1 /^^r E - 

t=l \i=l n v 



2^ 



t=\ 



li{n it s it {t)) + k 2 < 



logK 



+ K 2 



logT 



n h s it{t) V n H s it(f) 



(48) 



Defining /t := ^ X^^=i /*> we use Jensen's inequality to see that R(fr) < y X^fc=i R(ft)- Thus, 
all that remains is to control the last sum in (|48p . Using the definition of ji, we replace the sum 
with 



t=i 



c,-n 



Noting that 



Ti(T) 

£ Sit (t)- min ^^ = £ r 1 ^ < c'TiiT) 1 - 1 /? 

t:it=i i=l 

for some constant C" dependent on a, we can upper bound the last sum in (|48|) by 

T 



liinitShit)) + K 21 



log A 



+ «2 



logT 



K 



i=i 



Cin { a + K 2 n i 2 ^/logK + K^n, 2 ybg^ 



Ti(T) 1_1///3 . 



(49) 



Now that we have a sum of order K with terms Tj(T) that are bounded by T, that is, ^i(^) = 
X, we can apply Lemma [HJ Indeed, we set p = 1 — 1/(3 = 1 — min{a, ^} and a,, = Qn^~ a + 



^2^ 2 [-y/log K + v^ogT] in the lemma, and we see immediately that (j49|) is upper bounded by 



K 



qI jil-min{a,|} / ^ ^ 



c i n i a + K 2 n i 2 VlogK + Kirii 2 V lo g r 



max{l/a,2}' 



min{o,^} 



Dividing by T completes the proof that the average fx has good risk properties with probability 
at least 1 - ^/(TK 3 ) - 2k 1 (T 3 K z ) > 1 - 2k 1 /{TK z ). 
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